will estimate distribution characteristics for data sets of any size having up to 100 sample points. Twenty nine statistics (see below for detail) are computed and used to classify distributions into one of 24 cells of Tail Weight and Symmetry/Asymmetry as defined in Micceri (1989). Two robust variance estimates are also computed. A minimum of 20 observations is required to properly compute all statistics. The larger the sample size, the better the estimates. This template has been developed to work on three platforms
MS-Dos compatibles using Lotus 123, Excel or any application
capable of using WK1 files;
MACINTOSH using Excel or any spreadsheet capable of working with
Excel files;
AMIGA using Advantage and Cross-Doss.
If you received the incorrect form, please send me a message.
*************************
TOPICS
*************************
PRIMARY CONCERNS
SYMMETRY/TAIL WEIGHT CATEGORIES
SYMMETRY ESTIMATES
TAIL WEIGHT ESTIMATES
VARIANCE ESTIMATES
*************************
PRIMARY CONCERNS
*************************
Operating on the assumption that the nature of score distributions influences appropriate statistics, statistical findings and the interpretations allowable for different data sets, this program was written to enable comprehensive description of data sets and their classification into categories that allow one to estimate how robust various statistics will be when applied to said data. Categories levels are: Symmerty (from 1-Relatively Symmetric to 4-Exponentially Asymmetric) and Tail-Weight (from 1-Uniform to 3-About Gaussian to 6-Exponential).
Authors too numerous to mention have noted that most statistics are influenced by the distributional characteristics of data to which they are applied. However, most of these investigated data sets falling beyond Tail Weight Category 6 (Double Exponential) or at or beyond Symmetry Category 4 (Exponential). In fact, most data sets in all fields do not fall into those categories.
My research and evaluation of the robustness literature suggests that distributions falling in asymmetry categories 1 or 2 and below Tail Weight category 6 should not prove too damaging to OLS-based statistics. Even distributions falling into asymmetry category 3 are often not very dangerous. However, one should be quite careful about interpretation for any distributions falling outside of Symmetry Cell 1 and Tail Weight Cell 3 (About Gaussian).
*************************
SYMMETRY/TAIL WEIGHT CATEGORIES
*************************
Four levels of symmetry/asymmetry and six tail-weight categories are defined as:
———————————————————————————————————————————–
SKEWNESS TAIL WEIGHT
————————————————————————————————————————————
1 Relatively Symmetric Uniform
2 Moderate Asymmetry Lighter than Gaussian
3 Extreme Asymmetry About Gaussian
4 Exponential or Greater Moderately Heavy Tailed
5 Extremely Heavy Tailed
6 Double Exponential or Greater
————————————————————————————————————————————
By crossing these categories, twenty four cells are defined that range from (1,1) relatively symmetric, uniform tail-weight to (4,6) exponential or greater tail weight and asymmetry.
For both tail-weight and asymmetry, the “moderate” and “extreme” contaminations are defined by mixed normal distributions where moderate contamination (5%, +- 2 std dev) represents at least twice the expected number of observations more than two standard deviations from the mean, and extreme contamination (15%, +- 3 std dev) represents more than 100 times the expected number of observations more than three standard deviations from the mean.
Due to the inherent complexity of typical multinomial data, it is necessary to study several estimates for each distributional characteristic. As Elashoff and Elashoff (1978), discussing estimates of tail weight, note: "No single parameter can summarize the varied meanings of tail length." The same is true of symmetry or its lack (Hill and Dixon, 1982; Gastwirth, 1971). The recent trend toward extensive exploratory data analysis further emphasizes this fact. Therefore, in addition to skewness and kurtosis, the following estimates of symmetry and tail weight from Micceri (1989, p. 158) are computed:
*************************
SYMMETRY ESTIMATES
*************************
Three measures of symmetry/asymmetry
(1) M/M intervals (Hill and Dixon, 1982), defined as the mean/median interval divided by a robust scale estimate (1.4807 times one-half the interquartile range),
(2) skewness, and
(3) Hogg's (1974) Q2, where:
Q2 = U(05) - M(25)/M(25) - L(05)
where U(alpha)[M(alpha), L(alpha)] is the mean of the upper (middle, lower)[(N+1)alpha] observations. The inverse of this ratio defines Q2 for the lower tail (designated by a minus sign).
Note that among estimates of asymmetry, Q2 is sensitive to densities in the distant tail, that the third moment skewness estimate is sensitive to one extremely long tail, and that the standardized mean/median distance is sensitive to asymmetry anywhere in the distribution. The third moment skewness estimate tends to underestimate asymmetry, and underestimate it to a greater degree as the level of asymmetry increases (Micceri, 1989).
*************************
TAIL WEIGHT ESTIMATES
*************************
Two different types of tail-weight measure were also computed:
(1) Hogg's (1974) Q and Q1, where:
Q = U(05) - L(05)/U(50) - L(50)
Q1 = U(20) - L(20)/U(50) - L(50)
(2) C ratios of Elashoff and Elashoff (1978) - C90, C95 and C975 (the ratio of the 90th, 95th and 97.5th percentile points, respectively, to the 75th percentile point).
The Q statistics are sensitive to relative density, and the C statistics to distance (between percentiles). Note that percentiles are computed within class intervals, assuming discrete rather than continuous data. No other readily available packaged statistical application of which I am aware computes percentiles in this way, although it is the only appropriate one. Please write if you desire a copy of the paper describing the rationale and method (Micceri, 1988).
**********************************
VARIANCE ESTIMATES
**********************************
To compare variance among different score distributions, two robust variance estimates are included:
Mean Deviation = mean of absolute deviations from the arithmetic
mean
SQ = 1.4807 time one-half the semi-interquartile range
(Hill & Dixon, 1982)
Both of these closely approximate the standard deviation at the Gaussian. Both are less sensitive to a few extreme cases. If all three are close, then the standard deviation is an appropriate estimate. However, if they differ, probably one of the robust estimates is preferable when attempting to identify distributions having heterogeneous variances.
****************************
OTHER ESTIMATES
****************************
Percentile points (P025 to P975) including the 2.5, 5, 10, 25, 50, 75, 90, 95 and 97.5the percentile points. Note that Quartile 1 = 25, Quartile 2 = .50, Quartile 3 = .75.
Range and Sample Space, where the range = maximum - minimum and the sample space is defined as the number of different scale points having observations.
Variance and Standard Deviation (as typically defined).
Minimum and Maximum - respectively the lowest and highest obtained score.
***************
REFERENCES:
***************
Elashoff, J. D. and Elashoff, R. M. (1978). Effects of errors in statistical assumptions. In W. H. Kruskal and J. M. Tanur (Eds.) International encyclopedia of statistics (pp. 229-250). New York: The Free Press.
Gastwirth, J. L. (1971). On the sign test for symmetry. Journal of the American Statistical Association, 166, 821-823.
Hill, M. and Dixon, W. J. (1982). Robustness in real life: A study of clinical laboratory data. Biometrics, 38, 377-396.
Hogg, R. V. (1974). Adaptive robust procedures: A partial review and some suggestions for future applications and theory. American Statistical Association Journal, 69, 909-927.
Micceri, T. (1989). The Unicorn, The Normal Curve, and Other Improbable Creatures. The Psychological Bulletin, 105:1, p. 156-166.
Micceri, T. (1988). Discrete, Lumpy Data; The Median, and Dislocated Computer Algorithms. Paper presented at the American Statistical Association Conference, San Antonio, TX, January.
*************************
DATADEF is SHAREWARE
*************************
Please pass it on. If you like it, use it for more than one month, or want updates and revisions, please send between $5 and $15 to the starving author and his large, voracious family:
Theodore Micceri
527 Lantern Circle
Temple Terrace, FL 33617
(813) 988-0056
FAX (813) 974-3493
CIS 70303,1275
Also, please send comments or any improvements you make. Note that I avoided Macros and other simplifying techniques to allow easier transfer across platforms.